System and method for empirical organizational cybersecurity risk assessment using externally-visible data

ABSTRACT

A system and method for assessing the cybersecurity breach risk associated with a given organization is disclosed. The system and method assume no internal visibility into any organizational network. A taxonomy of possible data sources is defined and motivated. The system and method are both purely empirical and robust against common difficulties in scoring organizational networks, such as the raw number of network assets owned by the organization.

TECHNICAL FIELD

The subject matter described herein relates to predictive analytics inthe setting of multi-class classification. More specifically, thisdisclosure relates to a system and method for assessing cybersecuritybreach risk for an organization that is associated with a set of publicnetwork assets.

BACKGROUND

Cybersecurity breaches comprised of theft of sensitive information havebecome increasingly common, difficult to detect, and costly. Virtuallyall organizations have some online presence, while many of theseorganizations do not rigidly follow best security practices such asendpoint configuration and monitoring, SSL certificate maintenance, andisolation of sensitive information. Even ostensibly innocuousorganizations are seen as increasingly worthwhile targets among cybercriminals, both because of the increasing ease with which they can bebreached, as well as the increasing likelihood that a breach will yieldsaleable information on the black market and a long time to detection.

For these reasons, a market for cybersecurity incident insurance amongorganizations has grown substantially. As both the cybersecurity marketand the actual costs incurred from cyber breaches suffered have grown,there is a demand for an empirically derived measure of incident risk,and a need for a new cybersecurity risk score associated with thelikelihood that the organization will suffer a breach, the expectedvalue of its confidential information, and the expected potential damagethe organization might suffer due to exposure of its confidentialinformation.

In the past, the measurement of an organization's overall likelihood tosuffer a cybersecurity breach has been driven by expert questionnaires,penetration testers, audits, and heuristic measurements. Although theresulting risk assessments may provide substantial predictive power,they are inherently subjective. The canonical approach in producing suchan empirical measure of risk has been to rely on fine-grainedmeasurements of network traffic internal to the organization, whichallows for arbitrarily fine detail; however, in practice, both securityconcerns and configuration challenges have hindered efforts to produce arobust, generic measure of cyber risk based on data of this nature.

Accordingly, what is needed is a technique for measuring anorganization's overall likelihood to suffer a cybersecurity breach usingfully empirical methods, which can model an expected value of theorganization's confidential information, and an expected potentialdamage the organization might suffer due to exposure of its confidentialinformation by the cybersecurity breach.

SUMMARY

This document presents a computer-implemented predictive analyticssystem and method in the setting of cybersecurity risk assessment of aspecified organization.

In some aspects, systems and methods are described by which a cyberbreach risk of an organization is assessed by empirical, external means.In some implementations, assessing a risk of large networks that aremostly well-run is addressed, by being able to characterize the risk ofthe organization at large in terms solely or more strongly of theweakest network prefix within that organization. Accordingly, thedescribed system and method disclose a novel technique by which to scoreindividual network prefixes, and a method by which to combine the scoreassociated with the riskiest prefix(s) to generate an overallorganization score.

In preferred implementations, the specified organization is, by othermeans, associated with a set of IPv4 and/or IPv6 network blocks owned bythe organization in question. FIG. 2 demonstrates this process assumingmanual intervention in the association process. A collected dataset,consisting of potentially multiple historical observations of variouscharacteristics associated with each individual address within one ofthe previously associated network blocks, is presented to an aggregationprocessor which executes an algorithm calculate multiple featuresindicative of cybersecurity breach risk. A subset of these features iscalculated by aggregating all records within the dataset, producing“organization features”, while additional subsets are calculated byaggregating only those records within the dataset associated with aspecific network, producing “network features”. Such an additionalsubset of features is included among the broader set of features forevery network comprising the organization, yielding a set of networkfeatures for each. The multiple organization features are then used asinputs to an analytic organization model which calculates a securityprobability score, while the multiple sets of network features are usedas inputs to a network model, by which each individual network isassigned its own security probability score. The probability scorecorresponding to the organization model is then combined in apredetermined manner with the set of probability scores corresponding tothe network model to produce a final “Score Response” consisting both ofa numeric cybersecurity breach risk score leveraging both organizationaland network features and a ranked list of factors contributing to thatscore based on these inputs for use in remediation using the score.

In such embodiments, the Score Response represents the cybersecuritybreach risk and contributing factors for the organization in question.In addition to reflecting information derived from the individualnetwork scores and features—explicitly via the reason codes, andimplicitly via the score itself—the Score Response may be augmented bysome or all of the individual network scores or reason codes themselves.Whether and how such augmentation occurs, including the process by whichthe networks whose scores and reason codes are added to the ScoreResponse are selected and the specific information relevant to thesenetworks are determined, and may vary based on the implementation inquestion.

In one aspect, a computer-implemented method includes the step ofscoring, by at least one data processer, individual network prefixes ofa computer network of computer endpoints associated with anorganization. The scoring is processed according to a cybersecuritybreach risk scoring model executed by the at least one data processor ona dataset associated with the individual network prefixes, the scoringgenerating a score representing a risk of a cybersecurity breach of eachof the individual network prefixes. The method further includes the stepof aggregating, by the at least one data processor, the one or morenetwork prefixes of the computer network of computer endpointsassociated with the organization into an aggregated computer networkdataset. The method further includes the step of scoring, by the atleast one data processor, the aggregated computer network dataset. Thescoring is processed according to a cybersecurity breach risk scoringmodel executed by the at least one data processor on the aggregatedcomputer network dataset and one or more riskiest of the individualnetwork prefixes to generate an overall organization score, the riskiestof the individual network prefixes being determined based on a thresholdscore determined by the at least one data processor according to thecybersecurity breach risk scoring model.

In another aspect, a method, computer program product, and systemexecute a process including steps of calculating, by at least one dataprocessor, a risk of a cybersecurity breach for each of a plurality ofnetwork blocks associated with an organization, each of a plurality ofnetwork blocks having an individual network address of a networkassociated with the organization, the calculating being based on aplurality of historical observations of one or more features associatedwith each of the plurality of network blocks. The process furtherincludes aggregating, by the at least one data processor, all records inthe dataset associated with organization features to calculate a firstsubset of the one or more features associated with the plurality ofnetwork blocks, and aggregating, by the at least one data processor, allrecords in the dataset associated with network features to calculate asecond subset of the one or more features associated with each of theplurality of network blocks. The process further includes calculating,by the at least one data processor according to an analytic organizationmodel, an organizational cybersecurity breach probability score with theaggregated records associated with the organizational features, andcalculating, by the at least one data processor according to a networkmodel, a network cybersecurity breach probability score with theaggregated records associated with the network features. The processfurther includes combining, by the at least one data processor, theorganizational cybersecurity breach probability score and the networkcybersecurity breach probability score to generate a score responsecomprising a numeric cybersecurity breach risk score that leverages boththe organizational features and the network features and a ranked listof factors contributing to their respective scores.

Implementations of the current subject matter can include, but are notlimited to, systems and methods consistent with the disclosure herein,as well as articles that comprise a tangibly embodied machine-readablemedium operable to cause one or more machines (e.g., computers, etc.) toresult in operations described herein. Similarly, computer systems arealso described that may include one or more processors and one or morememories coupled to the one or more processors. A memory, which caninclude a computer-readable storage medium, may include, encode, store,or the like one or more programs that cause one or more processors toperform one or more of the operations described herein. Computerimplemented methods consistent with one or more implementations of thecurrent subject matter can be implemented by one or more data processorsresiding in a single computing system or multiple computing systems.Such multiple computing systems can be connected and can exchange dataand/or commands or other instructions or the like via one or moreconnections, including but not limited to a connection over a network(e.g. the Internet, a wireless wide area network, a local area network,a wide area network, a wired network, or the like), via a directconnection between one or more of the multiple computing systems, etc.

The details of one or more variations of the subject matter describedherein are set forth in the accompanying drawings and the descriptionbelow. Other features and advantages of the subject matter describedherein will be apparent from the description and drawings, and from theclaims. While certain features of the currently disclosed subject matterare described for illustrative purposes in relation to an enterpriseresource software system or other business software solution orarchitecture, it should be readily understood that such features are notintended to be limiting. The claims that follow this disclosure areintended to define the scope of the protected subject matter.

DESCRIPTION OF DRAWINGS

The accompanying drawings, which are incorporated in and constitute apart of this specification, show certain aspects of the subject matterdisclosed herein and, together with the description, help explain someof the principles associated with the disclosed implementations. In thedrawings,

FIG. 1 is a process flow diagram illustrating a method executed by asystem for empirical organizational cybersecurity risk assessment usingexternally-visible data;

FIG. 2 illustrates a method incorporating the process flow of FIG. 1,and by which users interact with the system; and

FIG. 3 illustrates a graphical user interface to provide a graphicalrepresentation of the methods described herein for providing empiricalorganizational cybersecurity risk assessment.

When practical, similar reference numbers denote similar structures,features, or elements.

DETAILED DESCRIPTION

Recent technological advances rendering internet-wide network scanningfeasible, on a short timescale, with minimal infrastructure overhead,have stimulated entirely new lines of inquiry regarding the potentialapplications of detailed analysis of public-facing network assets ingeneral. An immediate application of such analyses is in measuring thedegree of vulnerability associated with a set of public-facing networkassets, to establish that it is related to the overall security postureof the organization owning those assets, and to measure precisely theseverity of cyber risk to which the organization is exposed based on itssecurity posture. In particular, not only are externally observablecharacteristics of an organization's network informative in forming suchan empirical assessment, but they also may be used to predict aprobability that the organization in question will suffer acybersecurity breach incident in the future, as described in more detailbelow.

In accordance with systems and methods described herein, a rich varietyof externally observable data may be collected with relative ease owingthe aforementioned recent technological advances, and that among thesedata are signals indicating a wide range of aspects of the securityposture associated with the organization owning the network assets inquestion. Signals derived from internet-wide scans can be characterizedas illustrative of an organization's disposition in its capacity toregulate its own security posture. Similarly, organizations owningnetwork assets that appear to engage in malicious or illicit activitysuch as spam or phishing are characterized as indicative of theorganization's disposition with respect to its vigilance and ability tomodulate the activities occurring on devices on its network. These validcharacterizations can be powerful predictors of cyber breach risk.

Given the richness and volatility of the ecosystem around vulnerabilityexploits, it is unlikely that any specific probe of an organization'spublic-facing network assets could be expected to convey no informationwhatsoever regarding the security posture of the organizations owningthose assets. However, some probes are more informative from the breachrisk perspective than others.

For example, domain name server (DNS) resolvers that respond to publicqueries are themselves indicative of risk insofar as they increaseattack surface, but the subset of those resolvers that are configured toallow recursive DNS resolution requests are much more stronglyindicative of risk. Recursive DNS resolution is well known by thesecurity community to enable amplification of distributed denial ofservice (DDoS) campaigns, and is considered a poor security practice.While the organization owning an open recursive resolver might notnecessarily become the target of such an attack, it is very rare that anorganization have a legitimate need to run an open recursive resolver;far more frequently it is the result of a misconfiguration indicating apoor organizational security posture.

By the same token, the sheer number of network assets owned by anorganization that engage in secure sockets layer/transport layersecurity (SSL/TLS) handshakes is indicative of risk; the fraction ofthese network assets that are misconfigured in an observable manner isfar more so. In particular, servers presenting clients with an invalidcertificate chain as part of the handshake process are badlymisconfigured because they negate the primary benefit of TLS from theauthentication perspective. Hypertext Transport Protocol with aconnection encrypted by Transport Layer Security (HTTPS) is perhaps themost ubiquitous use of TLS; thus misconfiguration in this context isespecially egregious. Proper configuration in this context entails thatthe certificate chain presented to the client by the server consistentirely of certificates signed by a trusted authority and that the rootcertificate is signed by a trusted root certificate authority.Certificate chains may be invalid because of certificate expiration,untrusted issuing authority, and being self-signed, to name a few.

Several additional targeted probes can be used in informing theassessment of the security posture of an organization. For example, thenumber of servers responding to NTP requests in general and the subsetof these that respond to specific requests in such a way as to suggestthat they are susceptible to well-known exploits can be indicative ofmismanagement. Similarly the number of network assets configured torespond to ICMP echo requests-despite its being unrelated in theimmediate sense to any particular security concern specific to the assetitself-is a wholly unnecessary configuration setting, which can allow anattacker to map out the organization's network in order to identifyweaknesses, and which is trivially modified. Thus this signal reflectsthe degree to which best practices are adhered to by the networksecurity administration within the organization, and reflects therelative ease with which a determined attacker might discover weakpoints in the organization's network.

In addition to the targeted probes described above, a separate class ofsignals based on several reputation blacklists, darknet monitors, andother sources of lists identifying network assets have been identifiedand observed as engaging in suspicious activity, such as scanning,phishing schemes, and spam email. This class of signals comprisesexternally visible information that may be associated with networkassets on a global basis and are suggestive of breach risk and alsoavailable to would-be cyber attackers.

While the measurements described above may provide an accurate measureof organizational security risk, they all rely on the organization inquestion having been associated with a set of owned network assets.However, many organizations do not own any network assets butnevertheless have a tangible internet presence; for example, such anorganization may run its public-facing infrastructure on a third-partypublic cloud platform such as Amazon Web Services (AWS), in which allpublic internet protocol addresses (IPs) are owned by Amazon despite thecorresponding assets being managed by the organization. Suchorganizations are not suitable for scoring with a model relying whollyon asset-based data sources. In these cases, an informative score mayyet be computed by examining the organization's demographic informationsuch as total number of employees, total annual sales, industryclassification codes (e.g. SIC), geographic location of theheadquarters, number of physical corporate locations, etc.

Among the many fundamental difficulties involved in producing such arisk assessment are the questions of how the size of the organization'sattack surface is measured, and what this measurement suggests about itslikelihood of suffering a breach. In the most tangible aspect of attacksurface, intuition suggests simply that risk scales linearly in the sizeof the network-as reflected by the number of routable IP addresses,blocks, or endpoints within these blocks-because each additionalendpoint (or pool of endpoints within a block) represents a newindependent vector for penetration. Such an approach is clearly naïve:for example, a large organization with excellent security practices maymaintain large numbers of secure endpoints on static routable IPaddresses, all managed uniformly by a single central administrator. Inthis case, it is unlikely that an increase in the number of endpointswould imply an increase in breach risk. On the other hand, the sameorganization's breach risk would increase dramatically if even a verysmall number of poorly-managed endpoints were added to its collection ofnetwork assets.

The system and method described in this document consists of analgorithm by which an organization's cybersecurity breach risk score maybe calculated. Specifically, this disclosure relates to a multi-facetedprocess comprising calculation both of an “overall” score and a set of“network” scores calculated using each individual componentnetwork—necessarily comprising a mutually-disjoint set of address blocksthat covers the organization's address space—associated with thatorganization separately. The organization in question is presumed tohave been associated with at least one contiguous network blockconsisting of at least one network address.

The process by which the organization in question is associated with theat least one network block may necessitate extensive use of regionalinternet registries or other internet governing bodies, as well as dataprovided by commercial vendors for this purpose specifically. Thisassociation helps identify all network addresses owned or otherwisecontrolled by the organization in question. Furthermore, any of thescore and/or feature calculations described herein may make use ofadditional salient information, such as information from internetgoverning organizations and commercial vendors, identified, associated,prepared, and processed via any of a number of known mechanisms.Furthermore, such information need not necessarily be associated with aset of network addresses or blocks but potentially with the organizationitself—for example, SIC code or organization mailing address may be usedin calculating features as described in this disclosure; in this casethe features and risk scoring method and implementation are in the scopeof this disclosure.

FIG. 1 is a process flow diagram illustrating a method executed by asystem for empirical organizational cybersecurity risk assessment usingexternally-visible data, and by which an organization is assigned ascore according to an analytic model. The organization's networkaddresses (101) are supplied in advance and are used to collect adataset (105) of externally-visible raw data corresponding to each ofthese addresses or to the organization as a whole. The network data(105) is then aggregated (106) into a single set (107) containing allraw data records, the entirety of which is used (108) to generate a setof risk scoring features (109), which are in turn used as inputvariables to an analytic model (110) to obtain a risk score andassociated reason codes (111). Simultaneously, the feature generationprocess (112) is performed on the organization's network data (105) foreach individual network separately, resulting in a set of generated riskscoring features (113) for each individual network within theorganization. The features within the feature set (113) are used toupdate a set of feature quantile estimates (114) maintained by thescoring system and are in turn rescaled by the current quantileestimates, resulting in a set of features suitable for scoring by ananalytic model (115), resulting in a score and set of reason codes foreach network (116). Finally, the score and reason codes (111) for theorganization's network as a whole are combined (117) with the set ofscores and reason codes (116) for each individual network to produce afinal overall organization score, set of network scores, and reasoncodes (118). In some implementations, a method uses all addresses withinthe organization's associated network blocks to retrieve records from adatabase (104) containing observations regarding these and otheraddresses, or to create and insert new such records dynamically byaccessing the internet (103). For example, such a database may containresults of a port scan in which all addresses owned by the organization(and possibly additional addresses) are scanned daily in order todetermine which, if any, of these addresses appear to be routed to ahost on which any or all of a particular set of transmission controlprotocol (TCP) and/or user datagram protocol (UDP) ports are open, orappropriate requests are being responded to. Similarly, a record that iscreated dynamically may be the result of a new port scan performedcontemporaneously. Such dynamically-created records may furthermore beinserted into the database in question in real time, obviating thedistinction between these two mechanisms of record retrieval in thecontext of this disclosure.

Each record retrieved from the database contains results of a set ofmeasurements recorded for a single address at any given time. Multiplerecords for a single given address, corresponding to differentmeasurement dates, may therefore be among the records retrieved via thisprocess. Once the set of records have been retrieved, an aggregationprocess can be performed to generate a set of features suitable for useas inputs to an analytic risk scoring model based on the entire set ofretrieved records, and as described in more detail below. In preferredimplementations, the specified organization is associated with a set ofIPv4 and/or IPv6 network blocks owned by the organization in question.

FIG. 2 illustrates a method incorporating the process flow of FIG. 1,and by which users interact with the system, assuming manualintervention in the association process. The organization name anddomain (201) are used to retrieve network blocks and additionalorganization-level data (202)—such as SIC code, organization name, andRIR entries—which are returned to the user for a decision (203) as towhether to associate these data with the organization to be scored. Theresulting selected data are used be the model (204) to produce a scoreresponse as in FIG. 1, the result of which is displayed (205) to theuser. In particular, in the embodiment shown in this figure, the userprovides certain information—such as the organization name—regarding theorganization(s) to be scored by the process shown in FIG. 1, thenmanually verifies or modifies the network asset assignment anddemographic entity resolution applied algorithmically as part of thesystem and method. Subsequently, the organization(s) are scored and theresults delivered to the user by means of a graphical user interface,such as a web interface as viewed by a browser.

A dataset, collected by any of a number of mechanisms such as a globalinternet port scan performed daily over the course of several years,includes one or more historical observations of various characteristicsassociated with each individual address within one of the network blockspreviously associated, is presented to an aggregation algorithm whichcalculates multiple features indicative of cybersecurity breach risk,which features are described in further detail below. A subset of thesefeatures is calculated by aggregating all records within the dataset,producing “organization features”, while additional subsets arecalculated by aggregating only those records within the datasetassociated with a specific network, producing “network features”. Suchan additional subset of features is included among the broader set offeatures for every network comprising the organization, yielding a setof network features for each.

The multiple organization features are then used as inputs to ananalytic organization model which calculates a security probabilityscore, while the multiple sets of network features are used as inputs toa network model, by which each individual network is assigned its ownsecurity probability score. The probability score corresponding to theorganization model is then combined in a predetermined manner with theset of probability scores corresponding to the network model to producea final “Score Response” at 117 consisting both of a numericcybersecurity breach risk score leveraging both organizational andnetwork features and a ranked list of reason codes indicating thefactors contributing to that score based on these features for use inremediation using the score.

In such embodiments, the Score Response represents the cybersecuritybreach risk and contributing factors for the organization in question.In addition to reflecting information derived from the individualnetwork scores and features—explicitly via the reason codes, andimplicitly via the score itself—the Score Response may be augmented bysome or all of the individual network scores or reason codes themselves.Whether and how such augmentation occurs, including the process by whichthe networks whose scores and reason codes are added to the ScoreResponse are selected and the specific information relevant to thesenetworks are determined, may vary based on the implementation inquestion.

In some implementations, the database in question may contain theresults of daily probes indicating, for each IP address, whether or notit is configured to respond to a recursive DNS request. For example,consider a fictional organization named Acme Inc. which owns the addressblocks 192.17.0.0/16 (65,536 addresses) and 206.123.112.224/27 (32addresses). The database will be queried for all records correspondingto all addresses in these ranges. For each daily measurement, thedatabase will contain 65,568 records. Over a hypothetical three-daywindow, the returned dataset in its entirety will consist of 196,704records containing three columns: “Address”, “Date”, and “Responds toDNS recursive query”. Those skilled in the art will recognize thatseveral possible aggregation schemes may be implemented in order togenerate feature values given such a set of input records. In one suchimplementation, the dataset is grouped by address, and a new Booleanvalue, called “Responds to DNS recursive query on any date”, iscalculated for each address by constructing the element-wise OR of the3-component Boolean column “Responds to DNS recursive query”. The resultis a 65,568-record intermediate dataset consisting of twocolumns—“Address” and “Responds to DNS recursive query on anydate”—which may then be converted into a meaningful risk feature bycalculating the ratio of the number of records for which the Booleancolumn is “True” to the total number of records. Many additional riskfeatures may be calculated in similar fashion—for example one mayreplace the element-wise OR operation used in constructing the finalBoolean field with an element-wise AND operation, resulting in a fieldone might entitle “Responds to DNS recursive query on every date”.

In the same implementation, several such features are generated from thecomplete collection of records corresponding to all addresses in blocksowned by Acme Inc. Independently, the same collection of records ispartitioned according to its original network membership within the setof networks owned by the organization, resulting in one set of recordsfor each of the networks owned by the organization. In the same exampleof the same implementation, this procedure will result in two distinctdatasets: one for 192.17.0.0/16, consisting of 65,536 records for eachdaily observation, and another for 206.123.112.224/27, consisting of 32records for each daily observation. The overall result is a set offeatures calculated using records corresponding to all network addresseswithin the organization and multiple additional sets of features, oneset of features corresponding to a particular network owned by theorganization, each set of which being calculated using recordscorresponding only to the subset of addresses within the network inquestion. Both the fields used in the input records, the aggregationalgorithm, and the process by which the features for the organizationoverall are calculated may not be identical to the process by which thefeatures for each individual network are calculated, although theprocesses may be chosen to be identical.

Once calculated, each of the several sets of features is used as inputvariables to a specified predictive model, the specific processdepending in general upon whether the set of features in questioncorrespond to all or only one of the organization's networks, and themodel is used to produce a probability score reflecting the overallcybersecurity posture corresponding to the network used to calculate thefeatures. The features calculated from each individual networkseparately are calibrated via empirical quantile estimation, in whichthe quantiles for each individual feature may be estimated in near-realtime for all measurable networks corresponding to establishedorganizations.

The algorithm used in this context applies quantile estimation, variablescaling, and scoring algorithm in analogy to Multi-LayeredSelf-Calibrating models. Specifically, the values of each asset variableindividually are computed for all network prefixes in all organizationsexceeding a certain threshold with respect to risk as measured by theoverall organization score, and a pair of lower and higher quantilevalues and respectively, both intended to reside in the tail of thedistribution, recorded for each variable. As time passes, the quantilesmay be updated to reflect any changes in the underlying distributioneither by a batch process as described above or by a recursive onlinequantile estimation algorithm.

After having established the most recent values of

and

, the scoring process consists of two steps: First, the variables arecalculated as usual based on the specific prefix or sub-asset inquestion, then are scaled based on their quantile values. Specifically,

${{\,^{(,)}\left( {}^{()} \right)} = \frac{{()}_{-}^{()}}{{()}_{-}\mspace{14mu} {()}}},$

where ( ) is the raw value of the th variable, (,) is the scaled valueof the th variable, ( ) and ( ) are respectively the low and highquantiles of the th variable, and

is the vector of current high and low quantile estimates across allvariables. In some implementations, the score calculation is:

${= {\begin{matrix}\underset{\_}{1} \\\;\end{matrix}\mspace{14mu} {\min \left( {}^{{()}\mspace{14mu} {(,)}\mspace{14mu} {()}}, \right)}}},$

where

is the total number of variables used in this embodiment of the scorecalculation, ( ) is either a manually-tuned or learned weight for the thvariable, and

is a capping constant. In various implementations, the current high andlow quantiles comprising

for the th variable ( ) may be measured or estimated based on thedistribution of ( ) either across all networks globally or individuallyby segmenting networks or organizations according to certain criteria.For example, the high and low quantile estimates may be measured orestimated separately for each SIC code assigned an organization, andsubsequently applied in scaling only to those networks with that SICcode. Other possible criteria by which quantile estimates and scalingmay be segmented in this manner are SIC group, industry classification,number of employees, annual revenue, number of physical locations,number of vendor relationships, geographical location, and many others.

The features having been calibrated to a mutually-consistent scale, thefeature sets corresponding to various networks within theorganization—and the scores based on the output of the analytic modelscorresponding to these—may be used to draw meaningful comparisons amongdifferent networks both on the basis of score and individual features.Furthermore, because the calibration applied is based on quantileestimates based either on all networks globally or by all networkswithin the same predefined segment(s) such as SIC group or legal status,similar comparisons may be drawn among networks both within the sameorganization and otherwise.

The score for each individual sub-asset within an organization iscalculated as above and these scores are then combined into a singlescore. In some embodiments, the result of this combining process issimply the single score indicating highest risk across all sub-assetscores—in this manner the organization risk score is based strongly orentirely on the single riskiest sub-asset within the network, reflectingthat sub-asset's role as the “weakest link” in the organizationalsecurity posture.

The final score, associated reason codes, a subset of the sub-assetscores themselves and associated reason codes, and select network orthreat information calculated or retrieved in any aspect of the overallprocess are displayed to the end user as demonstrated in FIG. 3, whichshows an example of the results for a single organization that can bedelivered to the user by means of a graphical user interface. Theresults contain alphanumeric and/or graphical depictions of theorganization's cyber risk score (301) and reason codes (302) associatedwith that score, an overview of its attack surface, the historicalscores (303) for the organization in question, a summary (304) ofmalicious activities observed taking place on the organization'snetwork, and a score (306) and description (305) for each individualsub-asset comprising the organization's network.

In some implementations, a computer-implemented method includes thesteps of retrieving, by at least one data processor, a datasetcomprising records collected via other means, indexed by IP address,whose index is within any of the network blocks associated with aspecified organization via other means. The method further includesaggregating, by the at least one data processor, the same datasetcorresponding to a set of network blocks associated with theorganization, the aggregation being performed irrespective of thenetwork blocks themselves except insofar as they identify the addressesto be used in the aggregation, resulting in overall aggregated data. Themethod can further include generating, by the at least one dataprocessor, a set of overall organization features based on the overallaggregated data.

In some implementations, the method further includes aggregating, by theat least one data processor, the same dataset corresponding to a set ofnetwork blocks associated with the organization, the aggregation beingperformed for each individual network block separately, resulting innetwork aggregated data for each network block. The method can furtherinclude generating, by the at least one data processor, a set of networkfeatures, the processes of calculating which may or may not be identicalto the same processes as for the features for each individual networkblock separately, based on the network aggregated data corresponding toeach individual network block separately.

In some implementations, the method further includes aggregating, by theat least one data processor, demographic and firmographic data relatedto an organization, the aggregation generating aggregated organizationalnon-network data for each organization. A method can further includegenerating, by the at least one data processor, a set of non-networkorganizational features—such as SIC risk, geographical risk, networkownership data as reflected by internet governing organizations—for eachorganization, based on the aggregated organizational non-network datafor each organization. The method can further include aggregating, bythe at least one data processor, the dataset corresponding dark webintelligence data associated with organization, the aggregation beingperformed for both organization and each individual network blockseparately, resulting in network aggregated data for each network block.

In yet further implementations, a method includes generating, by the atleast one data processor, a set of dark web intelligence data associatedeach individual network block separately and the organization, based on“dark web” intelligence data associated with organization. The term“dark web” refers to data that is hidden from normal internet queriesand requiring special access methods, which are typically used to storestolen intelligence, user credentials, and valuable data such as paymentcard data or PII. The method can further include calculating, by the atleast one data processor, an overall odds score and ranked list ofoverall reason codes based on the overall generated features, the oddsscore indicating the level of cybersecurity breach risk associated withthe organization, and the ranked list of reason codes indicating thespecific factors that contributed most strongly to the odds score, a setof network odds scores and ranked lists of network reason codes based onthe network generated features corresponding to each individual networkblock separately, the network odds scores indicating the level ofcybersecurity breach risk associated with the network block, and theranked lists of network reason codes indicating the specific factorsthat contributed most strongly to the network odds score for thecorresponding network block,

In yet other implementations, a method can include calculating, by theat least one data processor, a final odds score and ranked list ofreason codes based on the overall odds score and associated overallreason codes and the set of all network odds scores and associatednetwork reason codes, the final odds score indicating the level ofcybersecurity breach risk associated with the organization. The methodcan further include selecting, by the at least one data processor, asubset of network scores, network reason codes, and associated metadatato be used in augmenting the information displayed along with the finalcyber breach odds score, the selection being implemented by a predefinedprocess predicated on any and all of the features, scores, reason codes,or additional characteristics. A method in accordance with someimplementations can include a step of constructing, by the at least onedata processor, a final Score Response consisting of the final cyberbreach odds score, associated reason codes, and, optionally, fieldsderived from the same selected subset.

One or more aspects or features of the subject matter described hereincan be realized in digital electronic circuitry, integrated circuitry,specially designed application specific integrated circuits (ASICs),field programmable gate arrays (FPGAs) computer hardware, firmware,software, and/or combinations thereof. These various aspects or featurescan include implementation in one or more computer programs that areexecutable and/or interpretable on a programmable system including atleast one programmable processor, which can be special or generalpurpose, coupled to receive data and instructions from, and to transmitdata and instructions to, a storage system, at least one input device,and at least one output device. The programmable system or computingsystem may include clients and servers. A client and server aregenerally remote from each other and typically interact through acommunication network. The relationship of client and server arises byvirtue of computer programs running on the respective computers andhaving a client-server relationship to each other.

These computer programs, which can also be referred to as programs,software, software applications, applications, components, or code,include machine instructions for a programmable processor, and can beimplemented in a high-level procedural and/or object-orientedprogramming language, and/or in assembly/machine language. As usedherein, the term “machine-readable medium” refers to any computerprogram product, apparatus and/or device, such as for example magneticdiscs, optical disks, memory, and Programmable Logic Devices (PLDs),used to provide machine instructions and/or data to a programmableprocessor, including a machine-readable medium that receives machineinstructions as a machine-readable signal. The term “machine-readablesignal” refers to any signal used to provide machine instructions and/ordata to a programmable processor. The machine-readable medium can storesuch machine instructions non-transitorily, such as for example as woulda non-transient solid-state memory or a magnetic hard drive or anyequivalent storage medium. The machine-readable medium can alternativelyor additionally store such machine instructions in a transient manner,such as for example as would a processor cache or other random accessmemory associated with one or more physical processor cores.

To provide for interaction with a user, one or more aspects or featuresof the subject matter described herein can be implemented on a computerhaving a display device, such as for example a cathode ray tube (CRT), aliquid crystal display (LCD) or a light emitting diode (LED) monitor fordisplaying information to the user and a keyboard and a pointing device,such as for example a mouse or a trackball, by which the user mayprovide input to the computer. Other kinds of devices can be used toprovide for interaction with a user as well. For example, feedbackprovided to the user can be any form of sensory feedback, such as forexample visual feedback, auditory feedback, or tactile feedback; andinput from the user may be received in any form, including, but notlimited to, acoustic, speech, or tactile input. Other possible inputdevices include, but are not limited to, touch screens or othertouch-sensitive devices such as single or multi-point resistive orcapacitive trackpads, voice recognition hardware and software, opticalscanners, optical pointers, digital image capture devices and associatedinterpretation software, and the like.

The subject matter described herein can be embodied in systems,apparatus, methods, and/or articles depending on the desiredconfiguration. The implementations set forth in the foregoingdescription do not represent all implementations consistent with thesubject matter described herein. Instead, they are merely some examplesconsistent with aspects related to the described subject matter.Although a few variations have been described in detail above, othermodifications or additions are possible. In particular, further featuresand/or variations can be provided in addition to those set forth herein.For example, the implementations described above can be directed tovarious combinations and subcombinations of the disclosed featuresand/or combinations and subcombinations of several further featuresdisclosed above. In addition, the logic flows depicted in theaccompanying figures and/or described herein do not necessarily requirethe particular order shown, or sequential order, to achieve desirableresults. Other implementations may be within the scope of the followingclaims.

1.-19. (canceled)
 20. A system comprising: at least one programmableprocessor; and a machine-readable medium storing instructions that, whenexecuted by the at least one programmable processor, cause the at leastone programmable processor to perform operations comprising: generatinga set of features based on a plurality of datasets, the set of featuresindicative of cybersecurity data breach risks of a computer networkassociated with an organization, the plurality of datasets correspondingto a plurality of network prefixes of the computer network associatedwith the organization, the plurality of datasets comprising an IPaddress mapped to the network prefixes and comprising datarepresentative of responses by the network prefixes to various requestsat the IP address; calibrating, in response to generating the set offeatures, the set of features with a quantile estimate from a database,the quantile estimate being calibrated by comparing cybersecurity breachrisks across various organizations; and scoring, based on the calibratedset of features, the plurality of datasets as an overall score, thescoring processed according to a cybersecurity breach risk scoring modelexecuted by the at least one programmable processor on the plurality ofdatasets.
 21. The system of claim 20, wherein the quantile estimatecorresponds to at least one feature of the set of features indicative ofa corresponding cybersecurity data breach risk of the computer networkassociated with the organization.
 22. The system of claim 20, whereinthe quantile estimate is determined based on a data segment, the datasegment corresponding to at least one of an industry, a number ofemployees in the organization, the an annual revenue, a number ofphysical locations, a number of vendor relationships, and geographicallocation.
 23. The system of claim 20, wherein the operations furthercomprise: retrieving the plurality of datasets by scanning the pluralityof network prefixes of the computer network associated with theorganization over a period of time; aggregating the plurality ofdatasets corresponding to the plurality of network prefixes of thecomputer network associated with the organization into an aggregatedcomputer network dataset, the plurality of datasets further comprisingmutually-disjoint set of IP address blocks mapped to the networkprefixes and further comprising additional data representative ofresponses by the network prefixes to the various requests at themutually-disjoint set of IP address blocks; and scoring, based on thecalibrated set of features, the aggregated computer network dataset, thescoring being processed according to the cybersecurity breach riskscoring model executed by the at least one programmable processor on theaggregated computer network dataset.
 24. The system of claim 23, whereinthe operations further comprise: updating, in response to aggregatingthe plurality of datasets, the quantile estimate at the database basedon the aggregated computer network dataset.
 25. The system of claim 20,wherein the cybersecurity breach risk scoring model is based onhistorical data related to at least one of the network prefixes, andwherein the historical data is tagged with breach versus no-breachincidents.
 26. The system of claim 20, wherein the plurality of datasetsfurther comprises a plurality of historical network observationscorresponding to the set of features indicative of cybersecurity databreach risks for the plurality of network prefixes.
 27. The system ofclaim 20, wherein the scoring weighs one or more riskiest of the networkprefixes more heavily to generate an overall organization score, the oneor more riskiest of the network prefixes being determined based on athreshold score determined by the at least one programmable processoraccording to the cybersecurity breach risk scoring model.
 28. The systemof claim 20, wherein the operations further comprise: aggregating afirst set of records in the plurality of datasets, wherein the first setof records is associated with organizational features to calculate afirst subset of the one or more features for the plurality of networkprefixes; aggregating a second set of records in the plurality ofdatasets, wherein the second set of records is associated with networkfeatures of each network block of the plurality of network blocks tocalculate a second subset of the one or more features indicative of acybersecurity breach risk for the plurality of network prefixes;calculating, according to an analytic organization model, anorganizational cybersecurity breach probability score based on the firstsubset calculated by aggregating the records associated with theorganizational features; calculating, according to a network model, anetwork cybersecurity breach probability score based on the quantileestimate and the second subset calculated by aggregating the recordsassociated with the network features; and combining the organizationalcybersecurity breach probability score and the network cybersecuritybreach probability score with the overall score.
 29. The system of claim28, wherein the analytic organization model is based on historical datarelated to one or more of the plurality of network prefixes and whereinthe network model is based on the historical data related to one or moreof the plurality of network prefixes, and wherein the historical data istagged with breach versus no-breach incidents.
 30. Acomputer-implemented method comprising: generating, by at least one dataprocessor, a set of features based on a plurality of datasets, the setof features indicative of cybersecurity data breach risks of a computernetwork associated with an organization, the plurality of datasetscorresponding to a plurality of network prefixes of the computer networkassociated with the organization, the plurality of datasets comprisingan IP address mapped to the network prefixes and comprising datarepresentative of responses by the network prefixes to various requestsat the IP address; calibrating, by the at least one data processor andin response to generating the set of features, the set of features witha quantile estimate from a database, the quantile estimate beingcalibrated by comparing cybersecurity breach risks across variousorganizations; and scoring, by the at least one data processor and basedon the calibrated set of features, the plurality of datasets as anoverall score, the scoring processed according to a cybersecurity breachrisk scoring model executed by the at least one data processor on theplurality of datasets.
 31. The method of claim 30, wherein the quantileestimate corresponds to at least one feature of the set of featuresindicative of a corresponding cybersecurity data breach risk of thecomputer network associated with the organization.
 32. The method ofclaim 30, wherein the quantile estimate is determined based on a datasegment, the data segment corresponding to at least one of an industry,a number of employees in the organization, the an annual revenue, anumber of physical locations, a number of vendor relationships, andgeographical location.
 33. The method of claim 30, further comprising:retrieving, by the at least one data processor, the plurality ofdatasets by scanning the plurality of network prefixes of the computernetwork associated with the organization over a period of time;aggregating, by the at least one data processor, the plurality ofdatasets corresponding to the plurality of network prefixes of thecomputer network associated with the organization into an aggregatedcomputer network dataset, the plurality of datasets further comprisingmutually-disjoint set of IP address blocks mapped to the networkprefixes and further comprising additional data representative ofresponses by the network prefixes to the various requests at themutually-disjoint set of IP address blocks; and scoring, by the at leastone data processor and based on the calibrated set of features, theaggregated computer network dataset, the scoring being processedaccording to the cybersecurity breach risk scoring model executed by theat least one data processor on the aggregated computer network dataset.34. The method of claim 33, further comprising: updating, by the atleast one data processor and in response to aggregating the plurality ofdatasets, the quantile estimate at the database based on the aggregatedcomputer network dataset.
 35. A computer program product comprising anon-transitory machine-readable medium storing instructions that, whenexecuted by at least one programmable processor, cause the at least oneprogrammable processor to perform operations comprising: generating aset of features based on a plurality of datasets, the set of featuresindicative of cybersecurity data breach risks of a computer networkassociated with an organization, the plurality of datasets correspondingto a plurality of network prefixes of the computer network associatedwith the organization, the plurality of datasets comprising an IPaddress mapped to the network prefixes and comprising datarepresentative of responses by the network prefixes to various requestsat the IP address; calibrating, in response to generating the set offeatures, the set of features with a quantile estimate from a database,the quantile estimate being calibrated by comparing cybersecurity breachrisks across various organizations; and scoring, based on the calibratedset of features, the plurality of datasets as an overall score, thescoring processed according to a cybersecurity breach risk scoring modelexecuted by the at least one programmable processor on the plurality ofdatasets.
 36. The computer program product of claim 35, wherein thequantile estimate corresponds to at least one feature of the set offeatures indicative of a corresponding cybersecurity data breach risk ofthe computer network associated with the organization.
 37. The computerprogram product of claim 35, wherein the quantile estimate is determinedbased on a data segment, the data segment corresponding to at least oneof an industry, a number of employees in the organization, the an annualrevenue, a number of physical locations, a number of vendorrelationships, and geographical location.
 38. The computer programproduct of claim 35, wherein the operations further comprise: retrievingthe plurality of datasets by scanning the plurality of network prefixesof the computer network associated with the organization over a periodof time; aggregating the plurality of datasets corresponding to theplurality of network prefixes of the computer network associated withthe organization into an aggregated computer network dataset, theplurality of datasets further comprising mutually-disjoint set of IPaddress blocks mapped to the network prefixes and further comprisingadditional data representative of responses by the network prefixes tothe various requests at the mutually-disjoint set of IP address blocks;and scoring, by the at least one programmable processor and based on thecalibrated set of features, the aggregated computer network dataset, thescoring being processed according to the cybersecurity breach riskscoring model executed by the at least one programmable processor on theaggregated computer network dataset.
 39. The computer program product ofclaim 38, wherein the operations further comprise: updating, in responseto aggregating the plurality of datasets, the quantile estimate at thedatabase based on the aggregated computer network dataset.